445 research outputs found

    Estimation of the Rate-Distortion Function

    Full text link
    Motivated by questions in lossy data compression and by theoretical considerations, we examine the problem of estimating the rate-distortion function of an unknown (not necessarily discrete-valued) source from empirical data. Our focus is the behavior of the so-called "plug-in" estimator, which is simply the rate-distortion function of the empirical distribution of the observed data. Sufficient conditions are given for its consistency, and examples are provided to demonstrate that in certain cases it fails to converge to the true rate-distortion function. The analysis of its performance is complicated by the fact that the rate-distortion function is not continuous in the source distribution; the underlying mathematical problem is closely related to the classical problem of establishing the consistency of maximum likelihood estimators. General consistency results are given for the plug-in estimator applied to a broad class of sources, including all stationary and ergodic ones. A more general class of estimation problems is also considered, arising in the context of lossy data compression when the allowed class of coding distributions is restricted; analogous results are developed for the plug-in estimator in that case. Finally, consistency theorems are formulated for modified (e.g., penalized) versions of the plug-in, and for estimating the optimal reproduction distribution.Comment: 18 pages, no figures [v2: removed an example with an error; corrected typos; a shortened version will appear in IEEE Trans. Inform. Theory

    Identifying statistical dependence in genomic sequences via mutual information estimates

    Get PDF
    Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5' untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI's Combined DNA Index System (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats, an application of importance in genetic profiling.Comment: Preliminary version. Final version in EURASIP Journal on Bioinformatics and Systems Biology. See http://www.hindawi.com/journals/bsb

    Geometric ergodicity in a weighted sobolev space

    Get PDF
    For a discrete-time Markov chain {X(t)}\{X(t)\} evolving on \Re^\ell with transition kernel PP, natural, general conditions are developed under which the following are established: 1. The transition kernel PP has a purely discrete spectrum, when viewed as a linear operator on a weighted Sobolev space Lv,1L_\infty^{v,1} of functions with norm, fv,1=supx1v(x)max{f(x),1f(x),,f(x)}, \|f\|_{v,1} = \sup_{x \in \Re^\ell} \frac{1}{v(x)} \max \{|f(x)|, |\partial_1 f(x)|,\ldots,|\partial_\ell f(x)|\}, where v ⁣:[1,)v\colon \Re^\ell \to [1,\infty) is a Lyapunov function and i:=/xi\partial_i:=\partial/\partial x_i. 2. The Markov chain is geometrically ergodic in Lv,1L_\infty^{v,1}: There is a unique invariant probability measure π\pi and constants B<B<\infty and δ>0\delta>0 such that, for each fLv,1f\in L_\infty^{v,1}, any initial condition X(0)=xX(0)=x, and all t0t\geq 0: Ex[f(X(t))]π(f)Beδtv(x),Ex[f(X(t))]2Beδtv(x),\Big| \text{E}_x[f(X(t))] - \pi(f)\Big| \le Be^{-\delta t}v(x),\quad \|\nabla \text{E}_x[f(X(t))] \|_2 \le Be^{-\delta t} v(x), where π(f)=fdπ\pi(f)=\int fd\pi. 3. For any function fLv,1f\in L_\infty^{v,1} there is a function hLv,1h\in L_\infty^{v,1} solving Poisson's equation: hPh=fπ(f). h-Ph = f-\pi(f). Part of the analysis is based on an operator-theoretic treatment of the sensitivity process that appears in the theory of Lyapunov exponents

    Compound poisson approximation via information functionals

    Full text link
    An information-theoretic development is given for the problem of compound Poisson approximation, which parallels earlier treatments for Gaussian and Poisson approximation. Nonasymptotic bounds are derived for the distance between the distribution of a sum of independent integer-valued random variables and an appropriately chosen compound Poisson law. In the case where all summands have the same conditional distribution given that they are non-zero, a bound on the relative entropy distance between their sum and the compound Poisson distribution is derived, based on the data-processing property of relative entropy and earlier Poisson approximation results. When the summands have arbitrary distributions, corresponding bounds are derived in terms of the total variation distance. The main technical ingredient is the introduction of two "information functionals,'' and the analysis of their properties. These information functionals play a role analogous to that of the classical Fisher information in normal approximation. Detailed comparisons are made between the resulting inequalities and related bounds
    corecore